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Abstract 

We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation 
driven by ‘controlled’ Markov noise. In particular, both the faster and slower recursions have non-additive con¬ 
trolled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior 
of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of 
the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution 
to the off-policy convergence problem for temporal difference learning with linear function approximation, using 
our results. 


1 Introduction 

Stochastic approximation algorithms are sequential non-parametric methods for finding a zero or minimum of a func¬ 
tion in the situation where only the noisy observations of the function values are available. Two time-scale stochastic 
approximation algorithms represent one of the most general subclasses of stochastic approximation methods. These 
algorithms consist of two coupled recursions which are updated with different (one is considerably smaller than the 
other) step sizes which in turn facilitate convergence for such algorithms. 

Two time-scale stochastic approximation algorithms m have successfully been applied to several complex prob¬ 
lems arising in the areas of reinforcement learning, signal processing and admission control in communication net¬ 
works. There are many reinforcement learning applications (precisely those where parameterization of value function 
is implemented) where non-additive Markov noise is present in one or both iterates thus requiring the current two 
time-scale framework to be extended to include Markov noise (for example, in |131 p. 5] it is mentioned that in order 
to generalize the analysis to Markov noise, the theory of two time-scale stochastic approximation needs to include 
the latter). 

Here we present a more general framework of two time-scale stochastic approximation with “controlled” Markov 
noise, i.e., the noise is not simply Markov; rather it is driven by the iterates and an additional control process as 
well. We analyze the asymptotic behaviour of our framework by relating it to limiting differential inclusions in 
both timescales that are defined in terms of the ergodic occupation measures associated with the controlled Markov 
processes. Next, using these results for the special case of our framework where the random processes are irreducible 
Markov chains, we present a solution to the off-policy convergence problem for temporal difference learning with 
linear function approximation. While the off-policy convergence problem for reinforcement learning (RL) with linear 
function approximation has been one of the most interesting problems, there are very few solutions available in 
the current literature. One such work [7] shows the convergence of the least squares temporal difference learning 
algorithm with eligibility traces (LSTD(A)) as well as the TD(A) algorithm. While the LSTD methods are not feasible 
when the dimension of the feature vector is large, off-policy TD(A) is shown to converge only when the eligibility 
function A S [0,1] is very close to 1. Another recent work proves weak convergence of several emphatic temporal 
difference learning algorithms which is also designed to solve the off-policy convergence problem. In mi min] the 
gradient temporal difference learning (GTD) algorithms were proposed to solve this problem. However, the authors 
make the assumption that the data is available in the “off-policy” setting (i.e. the off-policy issue is incorporated 
into the data rather than in the algorithm) whereas, in reality, one has only the “on-policy” Markov trajectory 
corresponding to a given behaviour policy and we are interested in designing an online learning algorithm. We use 
one of the algorithms from [3] called TDC with “importance-weighting” which takes the “on-policy” data as input 
and show its convergence using the results we develop. Our convergence analysis can also be extended for the same 
algorithm with eligibility traces for a sufficiently large range of values of A. Our results can be used to provide a 
convergence analysis for reinforcement learning algorithms such as those in for which convergence proofs have not 
been provided. 

To the best of our knowledge there are related works such as [n m mi [n] where two time-scale stochastic 
approximation algorithms with algorithm iterate dependent non-additive Markov noise is analyzed. In all of them 
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the Markov noise in the recursion is handled using the classic Poisson equation based approach of [niini and applied 
to the asymptotic analysis of many algorithms used in machine learning, system identification, signal processing, 
image analysis and automatic control. However, we show that our method also works if there is another additional 
control process as well and if the underlying Markov process has non-unique stationary distributions. Further, the 
mentioned application does not require strong assumption such as aperiodicity for the underlying Markov chain 
which is a sufficient condition if we use Poisson equation based approach nm. Additionally, our assumptions are 
quite different from the assumptions made in the mentioned literature and we give a detailed comparison in Section 

O 

The organization of the paper is as follows: Section [5] formally defines the problem and provides background and 
assumptions. Section [3] shows the main results. Section |4] discusses how one of our assumptions of Section [2] can be 
relaxed. Section [5] presents an application of our results to the off-policy convergence problem for temporal difference 
learning with linear function approximation. Finally, we conclude by providing some future research directions. 


2 Background, Problem Definition, and Assumptions 

In the following we describe the preliminaries and notation used in our proofs. Most of the definitions and notation 
are from 01111 [7]. 

2.1 Definition and Notation 

Let F denote a set-valued function mapping each point 9 G K™ to a set F{d) C K.™. F is called a Marchaud map if 
the following hold: 

(i) F is upper-semicontinuous in the sense that if —>■ 0 and Wn —t w with Wn G F{9n) for all n > 1, then 

w G F{9). In order words, the graph of F defined as {(6*,w) : w G F{9)} is closed. 

(ii) F{9) is a non-empty compact convex subset of K.™ for all 9 G K"*. 

(hi) 3c > 0 such that for all 9 G M’", 

sup ||z|| < c(l-k ||6»||), 
zee(d) 

where ||.|| denotes any norm on R™. 

A solution for the differential inclusion (d.i.) 

9{t) G Fm) (1) 

with initial point 9q G is an absolutely continuous (on compacts) mapping 9 : R —)■ K™ such that d(0) = do and 

9(t) G F(9(t)) 

for almost every t G R. If F is a Marchaud map, it is well-known that o has solutions (possibly non-unique) through 
every initial point. The differential inclusion m induces a set-valued dynamical system {$t}tGR defined by 

^t(9o) = {9(t) : d(-) is a solution to ((T]) with 9(0) = do}- 

Consider the autonomous ordinary differential equation (o.d.e.) 

9(t) = h(9(t)), (2) 

where h is Lipschitz continuous. One can write (ED in the format of (ED by taking F{9) = {h{9)}. It is well-known 
that (ED is well-posed, i.e., it has a unique solution for every initial point. Hence the set-valued dynamical system 
induced by the o.d.e. or flow is with 


<&t(^o) = {0(t)}, 

where d(-) is the solution to (ED with d(0) = 9q. It is also well-known that $t(.) is a continuous function for all t G K. 

A set A C R™ is said to be invariant (for F) if for all 9o G A there exists a solution d(-) of (ED with d(0) = do 
such that d(R) C A. 

Given a set A C and d", w” G A, we write d" w” if for every e > 0 and T > 0 3n G N, solutions 

di(-),..., d„(-) to (ED and real numbers H, ^ 2 , ■ ■ ■ ,tn greater than T such that 

(i) 9i(s) G A for all 0 < s < and for alH = I,..., n. 
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(ii) \\diiU) - 6'i+i(0)|| < e for alH = 1,... ,n - 1, 

(iii) ||6>i(0) - B"\\ < e and ||6>„(f„) - •u;"|| < e. 


The sequence (di(')i • ■ • is called an (e,T) chain (in A from 9" to w") for F. A set A C K™ is said to be 

internally chain transitive^ provided that A is compact and 9" w" for all 9"^w" G A. It can be proved that in 
the above case, A is an invariant set. 

A compact invariant set A is called an attraetor for <i>, provided that there is a neighbourhood 17 of A (i.e., for 
the induced topology) with the property that d($((0"),A) —>• 0 as t —>■ oo uniformly in 9" G U. Here d{X,Y) = 
supg//g^ 11^” ~ for AT, H C M™. Such a 17 is called a fundamental neighbourhood of the attractor A. An 

attractor of a well-posed o.d.e. is an attractor for the set-valued dynamical system induced by the o.d.e. 

The set 

t>0 

is called the uj-limit set of a point 9" G K™. If A is a set, then 

B{A) = {9" G C A} 

denotes its basin of attraction. A global attractor for $ is an attractor A whose basin of attraction consists of all 
R™. Then the following lemma will be useful for our proofs, see [5] for a proof. 

Lemma 2.1. Suppose $ has a global attractor A. Then every internally chain transitive set lies in A. 

We also require another result which will be useful to apply our results to the RL application we mention. Before 
stating it we recall some definitions from Appendix II.2.3 of EH: 

A point 9* G R™ is called Lyapunov stable for the o.d.e (ED if for all e > 0, there exists a <5 > 0 such that 
every trajectory of ED initiated in the ^-neighbourhood of 9* remains in its e-neighbourhood. 9* is called globally 
asymptotically stable if 9* is Lyapunov stable and all trajectories of the o.d.e. converge to it. 

Lemma 2.2. Consider the autonomous o.d.e. 9(t) = h{9{t)) where h is Lipschitz continuous. Let 9* be globally 
asymptotically stable. Then 9* is the global attractor of the o.d.e. 

Proof 1. We refer the readers to Lemma 1 of 1211 Chapter 3] for a proof. 

We end this subsection with a notation which will be used frequently in the convergence statements in the 
following sections. 

Definition 2.1. For function 9 {.) defined on [0,oo), the notation “9(t) — >■ A as t —>■ oo ” means that r\t>o{9{s) : s > t} G 
A. Similar definition applies for a sequence {9n}. 


2.2 Problem Definition 


Our goal is to perform an asymptotic analysis of the following coupled recursions: 


9n+l 

Wn+l 


= 9n-\-a{n) h{9n,Wn,zi^^)-\- 
= Wn-\- b{n) g{9n,Wn, + M^_l^ 


(3) 

(4) 


where G R'^, G M^, n > 0 and {Zn^}, i = 1,2 are random processes that we describe below. 

We make the following assumptions: 

(Al) {Zn^} takes values in a compact metric space S^^\i = 1,2. Additionally, the processes {Zn'’},i = 1,2 are 
controlled Markov processes that are controlled by three different control processes: the iterate sequences 
{wm} and a random process {A^*^} taking values in a compact metric space respectively with their 
individual dynamics specified by 

P(Z«iGR«|Z«,A«,0™,u;,„,m<n)= [ p«(d 2 /|Z«, A«,d„,u;„),n > 0, 

for Borel in S^'‘\i = 1,2, respectively. 

Remark 1. In this context one should note that uw require the Markov process to take values in a normed 
Polish space. 
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Remark 2. In \2U^ it is assumed that the state space where the controlled Markov Process takes values is 
Polish. This spaee is then eompaetified using the faet that a Polish space can be homeomorphically embedded 
into a dense subset of a compaet metric space. The vector field h{.,.) : x S' —>■ is considered bounded 

when the first component lies in a compact set. This would, however, require a continuous extension of h' : 

X (/)(S) —>■ defined by h'{x,s') = h{x, (j)~^{s')) to x (/)(S). Here is the homeomorphism defined 
by (p{s) = (p(s, si),p{s, S 2 ),...) € [0,1]°°, and {s^} and p is a countable dense subset and metric of the Polish 
space respectively. A sufficient condition for the above is h' to be uniformly continuous \22[ Ex: 13, p. 99]. 
However, this is hard to verify. This is the main motivation for us to take the range of the Markov process as 
compact for our problem. However, there are other reasons for taking compact state space which will be clear 
in the proofs of this section and the next. 


(A2) h : R'^+'= X S(i) ^ R'^ is jointly continuous as well as Lipschitz in its first two arguments uniformly w.r.t the 
third. The latter condition means that 

G S<'^\\\h{e,w,z^^^)-h{0fwfz^^^)\\ < L<^^^\\0-0'\\ + llw-w'll). 

Same thing is also true for g where the Lipschitz constant is Note that the Lipschitz constant does 
not depend on for z = 1, 2. 

Remark 3. We later relax the uniformity of the Lipschitz constant w.r.t the Markov process state space by 
putting suitable moment assumptions on the Markov process. 


(A3) {Mn'^}, i = 1,2 are martingale difference sequences w.r.t increasing cr-fields 

= a{0m,Wm, , z]]] ,m < n,i = l,2),n> 0, 

satisfying 

E[\\Mi%,r\p„] < K{i+ ii0„f+ \\w^\ni = 1,2, 

for n > 0 and a given constant A > 0. 

(A4) The stepsizes {a{n)}, {b{n)} are positive scalars satisfying 

a{n) 
b{n) 




0 . 


Moreover, a(n),b{n),n > 0 are non-increasing. 

Before stating the assumption on the transition kernel p^'‘\i = 1, 2 we need to define the metric in the space of 
probability measures P{S). Here we mention the definitions and main theorems on the spaces of probability 
measures that we use in our proofs (details can be found in Chapter 2 of [H]). We denote the metric by d and 
is defined as 

= f fjdp- j fjdv\,p,v&V{S), 
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where {fj} are countable dense in the unit ball of C{S). Then the following are equivalent: 

(i) d{pn,p) 0, 

(ii) for all bounded / in C'(S'), 


fdp„ 


fdp, 


(5) 


(iii) for all / bounded and uniformly continuous, 


/ fdpn / fdp. 

Is Js 


Hence we see that d{pn, /r) —0 iff Jg fjdpn —t Jg fjdp for all j. Any such sequence of functions {fj} is called 
a convergence determining class in P{S). Sometimes we also denote d{p,n,p) —>■ 0 using the notation => p. 
Also, we recall the characterization of relative compactness in 'P{S) that relies on the definition of tightness. 
A C ViS) is a tight set if for any e > 0, there exists a compact C S such that p{Kf} > 1 — e for all /r G A. 
Clearly, if S is compact then any A C ViS) is tight. By Prohorov’s theorem, A C ViS) is relatively compact 
if and only if it is tight. 

With the above definitions we assume the following: 
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(A5) The map x [/« x 9 (z«, 6», w) ^ p«(dy|z«, a^*), 0, w) G 7’(S'«) is continuous. 

Remark 4. (A5) is much simpler than the assumptions on n-step transition kernel in ^ Part II,Chap. 2, 
Theorem 6]. 


Additionally, unlike [20l p 140 line 13], we do not require the extra assumption of the continuity in the 9 
variable of p{dy\z, a, 9) to be uniform on compacts w.r.t the other variables. 

For 9n = 9, Wn = w for all n with a fixed deterministic (0, w) G and under any stationary randomized 

control it follows from Lemma 2.1 and Lemma 3.1 of [50] that the time-homogeneous Markov processes 
Zn\i = 1,2 have (possibly non-unique) invariant distributions Tr(i)C = li2. . 

Now, it is well-known that the ergodic occupation measure defined as 

:= G x [/«) 


satisfies the following: 


s(*) 




( 6 ) 


for /« : ^7^G C'b(S'«). 

We denote by D^'^\9,w),i = 1,2 the set of all such ergodic occupation measures for the prescribed 9 and w. In the 
following we prove some properties of the map {9,w) —>■ D^'^\9,w). 

Lemma 2.3. For all {9,w), D^'^\9,w) is convex and compact. 

Proof 2. The proof trivially follows from (Al), (A5) and (0). 

Lemma 2.4. The map {9,w) —>■ D^'^\9,w) is upper-semi-continuous. 

Proof 3. Let 9 n —t 9 ,Wn —t w and => G x such that G D^^^ 9 n,Wn). Let gn\z,a) = 

fs(i) f^^Hy)p‘'''Hdy\z,a,9n,Wn) and g^''\z,a) = /(*) (y)pW (dy|z, a, 0, w). From (01 we get that 


[ /«(z)4'«(dz, [/(*)) = lim / 

= lim / f f''^'>{y)p^^\dy\z,a,9.,^,Wn)'I>^,^\dz,da) 

JS(*)xC/G) JsG) 

= lim [ gl^\z,a)'I>l^\dz,da). 

Now, p'''‘^dy\z,a,9n,Wn) => p'''^'^{dy\z,a,9,w) implies gn\-,-) —t pointwise. We prove that the convergence 

is indeed uniform. It is enough to prove that this sequence of functions is equicontinuous. Then along with pointwise 
convergence it will imply uniform convergence on compacts ]22\. p. 168, Ex: 16]. This is also a place where (Al) is 
used. 

Define g' : 5'^®^ x x R‘^+^ —>• R 6y g'(z',a',9',w') = fg(i) f^^\y)p^^^(dylz,a',9',w'). Then g' is continuous. 
Let A = S'!®) X t/w X ({^n} U d) X ({wrt} U w). So, A is compact and g'\A is uniformly continuous. This implies 
that for all e > 0, there exists 6 > 0 such that if p'(s 1 , 82 ) < S, y'( 01 , 02 ) < d, ||di — 0211 < <5, Urui — ■u; 2 || < d, then 
\g'(si,oi,9i,wi) — g'(s 2 ,a 2 , 02, ^ 2 )] < e where si, S 2 G S^'-\ 01,02 G 0i, 02 G ({0„} U 0), wi, ■u ;2 G ({lUn} U w) and 
p' and p' denote the metrics in and t/l®! respectively. Now use this same S for the {gn\‘, ■)} get for all n the 
following for p'(zi,Z 2 ) < 5, p!( 01 , 02 ) < S: 

\g)]\zi,Oi) - g)]\z 2 , 02 )\ = \g'(zi,Oi,9n,Wn) - g'(z2,02,9n,Wn)\ < £. 


Hence {yra^(',')} '^s equicontinuous. For large n, sup^^ ^^^(0 \gnHz,a) — g^'^\z,a)\ < e/2 because of uniform 

convergence of{gn\-,-)}, hence fg(i)xu(‘') a) — g^^^(z, a)l'i'n\dz, da) < e/2. Now (for n large), 


gl:\z,a)¥:\dz,da)- / g^^\z,a)^^^\dz,da)\ 

S(OxC/(*) 0SG)xC/(9 

/ [ 9 n\z,a)-g^^'>(z,a)]¥f;\dz,da)+ [ gW(z, a)4'W(dz, da)- 

/s(Ox(7G) 0S(9xJ/(9 


'sG)x(7G) 


gW(z,a)4'W(dz,da)| 


< e 

< e. 


/2+\ [ gW(2;,a)4'W(dz,da) - / g^^'^(z,a)'I’^^'^(dz,da)\ 

JSI.') xUi') 


( 7 ) 
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The last inequality follows the fact that => Hence from ^ we get, 


s(<) 


[/(*)) = 


ls(''ixU<.') isw 


f(^')(y)pM{dy\z,a,e,w)'it^"'>{dz,da) 


proving that the map is upper-semi-continuous. 

Define g{9,w, i') = f g{d, w, z)i/(dz, for u G x and g{9, w) = {g{0, w,ix) : v G D^'^\6, w)}. 

Lemma 2.5. g{-,-) is a Marchaud map. 

Proof 4. (i) Convexity and compactness follow trivially from the same for the map {9,w) —>■ D^^^(9,w). 

(ii) 


\\9[^^w,v)\\ 

= 11 J g{0,w,zHdz,U^^^)\\ 

< J \\g{e,w,z)\\ix{dz,U^'^'>) 

< J L^^Hm + \\w\\ + Mo,o,z)\\Mdz,u^^y) 

<max(L(2),L(2) J ||g(0,0, C/(2)))(l + ||0|| + |H|). 

Clearly, K{9) = max(L(^\ f ||g(0,0, z)\\iy{dz, > 0. The above is true for all g{9, w, v) G g{9, w), v G 

Li(2)(6»,u,). 

(Hi) Let {0n,Wn) —>■ {9,w), g{9n,Wn, I'n) —>■ m,i'n G D^^l{9n,Wn). Now, {i^n} *s tight, hence has a convergent sub¬ 
sequence {r'rik} with V being the limit. Then using the arguments similar to the proof of Lemma \2.4\ one can 
show that m = g{9,w,i') whereas v G D^^\9,w) follows directly from the upper-semi-continuity of the map 
{9,w) -J> D'^'^\9,w) for all 9. 

Note that the map h{-, •) can be defined similarly and can be shown to be a Marchaud map using the exact same 
technique. 


2.3 Other assumptions needed for two time-scale convergence analysis 

We now list the other assumptions required for two time-scale convergence analysis: 

(A6) for all 9 gW^, the differential inclusion 

w{t) G g{9,w(t)) (8) 

has a singleton global attractor X{9) where A : —>■ is a Lipschitz map with constant K. Additionally, 

there exists a continuous function V : —>■ [0,oo) satisfying the hypothesis of Corollary 3.28 of [5] with 

A = {(0, X{9)) : 9 G R.'^}. This is the most important assumption as it links the fast and slow iterates. 

(A7) Stability of the iterates: sup„(||0„|| -b lliTnll) < oo a.s. 

Let 9{.),t > 0 be the continuous, piecewise linear trajectory defined by 9{t(n)) = 9n,n > 0, with linear interpo¬ 
lation on each interval \t{n),t(n -\- 1)), i.e., 

9(t) =9r, + {9n+i - 9n) . ^ ^ ^ [t{n),t{n-G 1)). 
t(n -b 1) — t(n) 

The following theorem is our main result: 

Theorem 2.6 (Slower timescale result). Under assumptions (Al)-(A7), 

{9n,Wn) -)■ Ug.gAo A(6>*))a.s. as oo., 

where Aq = nt>o{0(s) : s > t} is almost everywhere an internally chain transitive set of the differential inclusion 

9{t) G h{9{t)), (9) 

where h{9) = {h{9, X{9),v) : v G {9, X{9))}. We call ([5]) and (0 as the faster and slower d.i. to correspond with 

faster and slower recursions, respectively. 
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Corollary 1. Under the additional assumption that the inclusion 

m G hm)), 

has a global attractor set Ai, 

(dn,Wn) Ue.gAi(0*, A(6>*))a.s. as n ^ oo. 

Remark 5. In case where the set D^'^\0,w) is singleton, we can relax (A6) to local attractors also. The relaxed 
assumption will be 

(A6)’ The function g{6, w) = J g{0,w, z)T^^l^(dz) is Lipschitz continuous where is the only element of 
Further, for all 0 G the o.d.e 

w{t) = g{0,w(t)) (10) 

has an asymptotically stable equilibrium \{0) with domain of attraction Gg where X : —>■ is a Lipschitz 

map with constant K. Also, assume that P|g Gg is non-empty. Moreover, the function V' : G ^ [ 0 , oo) defined 
by V'(6,w) = Vg(w) is continuously differentiable where Vg{.) is the Lyapunov function (for definition see \21\. 
Chapter 11.2.3]) for the o.d.e. \10\) with X{0) as its attractor, and G = ^ Gg}. This extra condition 

is needed so that the set graph(X):={{9, X{9)) : 9 G K.'^} becomes an asymptotically stable set of the coupled o.d.e 

wit) = gi9{t),wit)),9it) = 0 . 

Note that (A6)’ allows multiple attractors (at least one of them have to be a point, others can be sets) for the faster 
o.d.e for every 9. 

Then the statement of Theorem \2 . 61 will be modified as in the following: 

Theorem 2.7 (Slower timescale result when X{9) is a local attractor). Under assumptions (A1)-(A5), (A6)’ and 
(A7), on the event “{wn} belongs to a compact subset B (depending on the sample point) o/ Pigged Gg eventually”, 

i9n,Wn) -t U6 I*gAo(^*, A(6>*))a.s. as n ^ oo. 

The requirement on {w„} is much stronger than the usual local attractor statement for Kushner-Clarke lemma 
na Section II. C] which requires the iterates to enter a compact set in the domain for attraction of the local attractor 
infinitely often only. The reason for imposing this strong assumption is that graph(X) is not a subset of any compact 
set in and hence the usual tracking lemma kind of arguments do not go through directly. One has to relate the 

limit set of the coupled iterate {9n,Wn) to graph(X) (See the vroof of Lemma \3.b]) . 

We present the proof of our main results in the next section. 


3 Main Results 

We first discuss an extension of the single time-scale controlled Markov noise framework of m under our assumptions 
to prove our main results. Note that the results of [20] assume that the state space of the controlled Markov process 
is Polish which may impose additional conditions that are hard to verify. In this section, other than proving our two 
time-scale results, we prove many of the results in j^D] (which were only stated there) assuming the state space to 
be compact. 

We begin by describing the intuition behind the proof techniques in m- 

The space C'([0,oo);R'^) of continuous functions from [0,oo) to is topologized with the coarsest topology such 
that the map that takes any / G C^O, oo); R'^) to its restriction to [0,T] when viewed as an element of the space 
C'([0,T];R‘^), is continuous for all T > 0. In other words, /« —t / in this space iff fn\[o,T] f\[o,T]- The other 
notations used below are the same as those in doilllj. We present a few for easy reference. 

Consider the single time-scale stochastic approximation recursion with controlled Markov noise: 

^n-t-l — F aijl) (h{Xyi, Yji) . (f 1) 

Define time instants t(0) = 0, tin) = J2m=o (^{wi), n> 1. Let x{t), t > 0 be the continuous, piecewise linear trajectory 
defined by x(t(n)) = x„,n > 0, with linear interpolation on each interval [t(n),t(n + 1)), i.e., 

x{f) =Xn + (xn+i - G [t(n),t(n-b 1)). 

t[n -b 1 ) — t{n) 

Now, define h{x,v) = J h{x,z)v{dz,U) for v G P{S x U). Let y{t),t > 0 be the random process defined by 
y(t) = S(^Yn,Zr,) foi' t G [t{n), t(n + 1)), n > 0, where ^(y,a) is the Dirac measure corresponding to (y, a). Consider the 
non-autonomous o.d.e. 

x(t) = h(x(t),fj,(t)). (12) 
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Let > s, denote the solution to (IT^ with x^(s) = x(s), for s > 0. Note that x^{t),t G [s, s + T] and x^{t),t > s 

can be viewed as elements of (^([O, T]; R.'^) and (^([O, oo); R.'^) respectively. With this abuse of notation, it is easy to 
see that s > 0} is a pointwise bounded and equicontinuous family of functions in C([0,T];R'^) VT > 0. 

By Arzela-Ascoli theorem, it is relatively compact. From Lemma 2.2 of m one can see that for all s{n) f oo, {x(s(n) + 

•) l[s(n),s(n)+T] i 

n > 1} has a limit point in C([0, T]; R"^) VT > 0. With the above topology for (^([O, oo); R'^), {x'*(.),s > 0} is 
also relatively compact in (^([O, oo);R"^) and for all s{n) f oo,{x(s(n) + .),n > 1} has a limit point in (^([O, oo);R'^). 
One can write from (ED the following: 

x(u(n) + t) = x{u(n)) + f h(x(u(n) + T),h'{u(n) + T))dT+ W'^{t), 

Jo 

where u(n) f co, x{u{n) + .)—>• x{-),h'{t) = {Yn,Zn) for t G [t{n),t{n + l)),n > 0 and W"(t) = W{t + u{n)) — 
W{u{n)),W{t) = Wn + {Wn+i - = Yl=o a{k)Mk+i,n > 0. From here one cannot directly take 

limit on both sides as finding limit points of j^(s+.) as s —>■ oo is not meaningful. Now, h{x, y) = J h{x, z)S(^y^a}i,dzxU). 
Hence by defining h(x, p) = f h{x, z)p{dz) and p(t) = one can write the above as 

x{u{n) + t) = x{u{n)) + f h{x{u{n) + t), p{u{n) + T))dT+ W'^{t). (13) 

Jo 

The advantage is that the space K of measurable functions from [0,oo) to 'P{S x U) is compact metrizable, so sub- 
sequential limits exist. Note that /i(-) is not a member of U, rather we need to fix a sample point, i.e., p{.-,u}) G U. 
For ease of understanding, we abuse the terminology and talk about the limit points /!(•) of p{s + .). 

From m one can infer that the limit x{-) of x{u{n) + .) satisfies the o.d.e. x{t) = h{x{t),p{t)) with p{-) replaced 
by /i(-). Here each G R in jl{-) is generated through different limiting processes each one associated with the 

compact metrizable space Ut = space of measurable functions from [0,t] to T(S' x U). This will be problematic if 
we want to further explore the process /x(-) and convert the non-autonomous o.d.e. into an autonomous one. 

Hence the main result is proved using an auxiliary lemma uni Lemma 2.3] other than the tracking lemma 
(Lemma 2.2 of [ID]). Let u{n{k)) f oo be such that x{u{n{k)) -(-.)—>• i(-) and p{u{n{k)) -(-.)—>• /!(•), then using 
Lemma 2.2 of ED] one can show that (•) —>■ i(-). Then the auxiliary lemma shows that the o.d.e. trajectory 

x“("(*^))(-) associated with p(u(n{k)) + .) tracks (in the limit) the o.d.e. trajectory associated with /!(•). Hence 
Lemma 2.3 of ED] links the two limiting processes i(-) and (!(•) in some sense. Note that Lemma 2.3 of EO] involves 
only the o.d.e. trajectories, not the interpolated trajectory of the algorithm. 

Consider the iteration 

On+i = 6n a (n) [/i(0„, lA)-|-e„ + M„+i], (14) 

where e„ —0 and the rest of the notations are same as EO]- Specifically, {Yn} is the controlled Markov process 
driven by {0„} and M„+i,n > 0 is a martingale difference sequence. Let 0{t),t > 0 be the continuous, piecewise 
linear trajectory of m defined by 9{t{n)) = 9mn > 0, with linear interpolation on each interval \t{n),t{n -|- 1)). 
Also, let 9^{t),t > s, denote the solution to (fT^ with 0'’(s) = 9{s), for s > 0. 

The convergence analysis of m requires some changes in Lemma 2.2 and 3.1 of [20] . The modified versions of 
them are precisely the following two lemmas. 

Lemma 3.1. For any T > 0, suptgjg^^^y] \\9{t) — 0®(t)|| —>■ 0, a.s. as s ^ oo. 

Proof 5. The proof follows from the Lemma 2.2 and the remark 3 thereof (p. 144) of \2(Jj . 

Now, p can be viewed as a random variable taking values inU = the space of measurable functions from [0,oo) 
to V{S X U). This space is topologized with the coarsest topology such that the map 

v{-) GU ^ j g{t) J fdn{t)dt G R 

is continuous for all / G C{S),T > 0,g G T2[0,T]. Note that U is compact metrizable. 

Lemma 3.2. Almost surely every limit point of {p{s + .),9{s -I- .)) as s ^ oo is of the form {jl{-),9{-)) where fl{-) 
satisfies pit) G D{9{t)) a.e. t. 

Proof 6. Suppose that u{n) f oo, p{u{n) -(-.)—;► p{-) and 9(u(n) -(-.)—>■ 9{-). Let {fi} be countable dense in the unit 
ball of C{S), hence a separating class, i.e., Vi,f fidp = f fidv implies p = v. For each i, 


n — 1 

Cn=Yl aim){MYm+i) 

m—1 


J My)p{dy\Ym 
8 



is a zero-mean martingale with Tn = a{Om,Ym, Zm,rn < n). Moreover, it is a square integrable martingale due to 
the fact that fi’s are bounded and each is a finite sum. Its quadratic variation process 

71 — 1 

An=^ a{mf E[{fi{Ym+i) - 

m—0 

is almost surely convergent. By the martingale convergence theorem, ^ 0 converges a.s. for all i. As before let 
T(n, t) = min{m > n : t{m) > t(n) + t} for t > 0,n > 0. Then as n ^ oo, 

T{n,t) 

a(m)(/*(y„+i) - / My)pidy\Ym : ^mi ^TTi)) ^ d.S. 

m—n 

fort > 0. By our choice of {fi} and the fact that {a(n)} is an eventually non-increasing sequence (the latter property 
is used only here and in Lemma roi) . we have 

r(n,i) 

(a(m) - a(m + l))fi{Ym+i) 0, a.s. 


J My)pidy\Ym,z^,em)f\Tm] + E[iCoy] 


From the foregoing. 


r(7i,t) 


y] (a(m + l)/i(Fm+i) - a(m) My)p{dy\Yjn, Zjn,d7n)) ^ 0, a.s. 

m=n 

for all t > 0, which implies 

T(n,t) 

a{m){fi{Ym) - / My)p{dy\Ym,Zm,dm)) 0, a.s. 


for all t > 0 due to the fact that a{n) —> 0 and fi{.) are bounded. This implies 


lt{n) 


{ ifiiz)- / fiiy)pidy\z,a,e{s)))y{s,dzda))ds ^ 0, a.s. 


and that in turn implies 


ru{n)+t r r 

/ { iMz)- / f^{y)p{dy\z,a,9{s)))fi{s,dzda))ds ^ 0, a.s. 

Ju{n) J J 


(this is true because ain) 0 and fi{-) is bounded) where 0{s) = On when s € [t(n),t{n + 1)) for n > 0. Now, one 
can claim from the above that 


i(n)+t 


!i(n) 


i (MV - / My)pidy\z,a,0{s)))fj,{s,dzda))ds ^ 0, a.s. 


This is due to the fact that the map S x U x B {z,a,6) ^ J f{y)p{dy\z,a,9) is continuous and hence uniformly 
continuous on the compact set A = S x U x M where M is the compact set s.t. 9n & M for all n. Here we also use 
the fact that ||0(s) — 0^,11 = \\h{9m,Ym) Cm + Mm+i\\is — Sm) —>■ 0, s G [tm,tm-i-i) as the first two terms inside the 
norm in the R.H.S are bounded. The above convergence is equivalent to 

J j My)pidy\z,a,d{s + u{n)))p{s-\-u{n),dzda))ds ^0, a.s. 

Fix a sample point in the probability one set on which the convergence above holds for all i. Then the convergence 
above leads to 


J ij Mz)- J fi{y)p{dy\z,a,9{s)))p,{s,dzda)ds = 0yi. 

Here we use one part of the proof from Lemma 2.3 of ]2(ff that if p^{-) —>■ G U then for any t > 0, 


( 15 ) 


f{s,z,a)yF{s,dzda)ds— / / f{s, z,a)ix°°{s,dzda)ds ^ 0, 
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for all f € X S X A) and the fact that fn{s,z,a) = J fi{y)p{dy\z, a, 9{s + u{n))) converges uniformly to 

f{s,z,a) = f fi(y)p(dylz, a, 9(s)). To prove the latter, define g : x [0, i] x S' x A —>■ K. 6j/ g{9{-),s,z,a) = 

f fi(y)p(dylz, a, 9(s))). To see that g is continuous we need to check that if 9ni-) —>■ 9(-) uniformly and s{n) —>■ s, 
then 9n{s(n)) —>■ 9{s). This is because \\9n{s(n)) — 0(s)|| = ||0„(s(n)) — 9{s{n)) + 9{s(n)) — 0(s)|| < |10„(s(n)) — 
6>(s(n))|| + ||0(s(n)) — 0(s)||. The first and second terms go to zero due to the uniform convergence o/0„(-),n > 0 and 
continuity of 9(■) respectively. Let A — {9{u{n) + •) I[«(«),u(n)+t] j > 1} U 0(-)l[o,t] ■ ^ *5 compact as it is the union of 
a sequence of functions and their limit. So, 5|(Ax[o,t]xSxf7) is uniformly continuous. Then using the same arguments 
as in Lemma lK^ we can show equicontinuity o/{/„(.,.)}, that results in uniform convergence and thereby Iil5\) . An 
application of Lebesgue ’s theorem in conjunction with shows that 

J (Mz) - J My)pidy\z, a, 9{t)))fl{t, dzda) = 0 Vi 

for a.e. t. By our choice of {fi}, this leads to 

p{t,dy X U) = J p{dy\z,a,9{t))fL{t,dzda) 

a.e. t. Therefore the conclusion follows by disintegrating such measure as the product of marginal on S and the 
regular conditional law on U fJ2(A p IjO]). 

Remark 6. Note that the above invariant distribution does not come “naturally”; rather it arises from the assumption 
made to match the natural timescale intuition for the controlled Markov noise component, i.e., the slower iterate 
should see the average effect of the Markov component. 

The proof of the following lemma, in this case, will be unchanged from its original version, so we just mention it 
for completeness and refer the reader to Lemma 2.3 of m for its proof. 

Lemma 3.3. Let —>■ ) G U. Let 0”(-), n = 1, 2,..., oo denote solutions to ilS\} corresponding to the case 

where /i(-) is replaced by yA{-), for n = 1,2,.. .oo. Suppose 0”(O) —>■ 0°“(O). Then 

lim sup \\9^{t)-9°^{t)\\=Q 


for every T > 0. 

Lemma 3.4. Almost surely, {9n} converges to an internally chain transitive set of the differential inclusion 

9{t) G h{9{t)), (16) 


where h{9) = {h(9,v) : v G D{9)}. 

Proof 7. Lemma WT^ shows that every limit point {fL{-),9{-)) of {p{s +.),9{s +.)) as s ^ oo is such that9{-) satisfies 
m with pf) = ftf). Hence, 9{-) is absolutely continuous. Moreover, using Lemma\3f^ one can see that it satisfies 
(03) a.e. t, hence is a solution to the differential inclusion C3). Hence the proof follows. 

Lemma 3.5 (Faster timescale result). {9n,Wn) — t {(^;A(d)) : 9 G M''*} a.s. 


Proof 8. We first rewrite (0) as 


9n+i =9n + b{n) 


■M. 


(3) 


n+1 


where e„ = ^^h{9n,Wn, 

{9,w) G M^+^G(a,z) = 
framework of as 


Zn'^) — >■ 0 as n —>■ OO a.s. and for n > 0. Let an = (0„,w„),a = 

(0,5(0, z)), e(j = (cn, 0 ), Then one can write (0) and ^ in the 


a„+i — an + b[n) G(an, Z^'^) + + M, 


(4) 

n+1 


(17) 


with —>■ 0 as n —>■ 00 . a„,n > 0 converges almost surely to an internally chain transitive set of the differential 

inclusion 

aft) G G{a{t)), 

where G{a) = {G(a,z^) : v G D^‘^\9,w)} with G(a,v) = {Q,g{9,w,v)). In other words, {9n,Wn),n > 0 converges to 
an internally chain transitive set of the differential inclusion 


wft) G g(9{f),w{f)),9lf) = 0. 


The rest follows from the second part of (A6). 
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Remark 7. Under the conditions mentioned in Remark 4 the above faster timescale result should be modified as 
follows: 

Lemma 3.6 (Faster timescale result when X{9) is a local attractor). Under assumptions (Al) - (A5), (A6)’ and 
(A7), on the event “{wn} belongs to a compact subset B (depending on the sample point) Ge eventually”, 

(dmWn) —>■ {(0, A(0)) : 6 £ a.s. 

Proof 9. Fix a sample point cv. The proof follows from these observations: 

1. continuity of flow for the coupled o.d.e around the initial point, 

2. sup„ ||0„|| = Ml <oo, 

3. the fact that the set graph(X) is Lyapunov stable (V'{.) as mentioned in (A6) ’ will be a Lyapunov function for 
this set), and 

4- the fact that Hoo ■ s > t is an internally chain transitive set of the coupled o.d.e 

wit) = g{9{t),w(t)),e{t) = 0, (18) 

where a(.) is the interpolated trajectory of the coupled iterate {a„}. 

As {9 : ||0|| < Ml} X B C UeeR<j{{^} ^ Gg}, the first three observations show that for all e > 0, there exists a 
Tg > 0 such that any o.d.e trajectory for m with starting point on the compact set {9 : ||0|| < Mi} x B reaches the 
e-neighbourhood of graph(X) after time Tg. Further, 

Pi a(s) -. s>t C {9 : ||0|| < Mi} x B. 

t>o 

Then one can use the last observation by choosing T > T„ to show the reguired convergence to the set graph(X). 

Remark 8. One interesting question in this context is to analyze whether one can extend the single timescale local 
attractor convergence statements to the two timescale setting under some verifiable conditions. More specifically, if 
there is a global attractor Ai for 

m e mt)), 

then can one provide verifiable conditions to show 

P\{9n,Wn) UfleAi (0j A(0))] > 0. 

Here X{9) is a local attractor as mentioned in (A6)’. 

There are two ways in which this could possibly be tried: 

1. Use Theorem \2.1\ where we show that on the event {r/;„} belongs to a compact subset B (depending on the 
sample point) off^g^g^^Gg “eventually”, 

{9n,Wn) -)■ Ue.gAi(0*, A(0*))a.s. as oo, 

which is an extension of Kushner-Clarke Lemma to the two timescale case. Therefore the task would be to 
impose verifiable assumptions so that P({wn} belongs to a compact subset B (depending on the sample point) 
of “eventually”) > 0. In a stochastic approximation scenario it is not immediately clear how one 

could possibly impose verifiable assumptions so that such a probabilistic statement becomes true. 

2. The second approach would be to extend the analysis of to the two timescale case. In our opinion this is 

very hard as this analysis is based on the attractor introduced by Benaim et al. whereas the coupled o. d.e 03) 
which tracks the coupled iterate (0n,w„) (therefore the interpolated trajectory of the coupled iterate will be an 
asymptotic pseudo-trajectory m for 118\) ) has no attractor. The reason is that one cannot obtain a fundamental 
neighbourhood for sets like Ug^Aiid, X{9)) as the 9 component will remain constant for any trajectory of the 
above coupled o.d.e. 

Thus it is immediately not clear as to how this question can be addressed and this will be an interesting future 
direction. 
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From the faster timescale results we get, HiCri — A(0n)|| —t 0 a.s., i.e, {w„} asymptotically tracks {A(0r!,)} a.s. 
Now, consider the non-autonomous o.d.e. 

9{t) = (19) 

where ii(t) = d„(i) .(i) when t G [t(n), tin + 1)) for n > 0 and h(9, w,v) = f h(0, w, z)v{dz). Let t > s denote 
the solution to (fT^ with 0®(s) = 9{s), for s > 0. Then 

Lemma 3.7. For any T > 0, suptgjs^^+y] \\9{t) — 0®(t)|| —>■ 0, a.s. 

Proof 10. The slower recursion corresponds to 

9n+i=9n+a{n) h{9n,Wn, . 

Let t{n + m) € [t{n),t{n) + T]. Let [t] = max{t(fc) : t{k) < t}. Then by construction, 

m—1 

9{t{n + m)) = 9{t{n)) + ^ a(n + k)h(9{t{n + k)), Wn+k, + 5n,n+m 


/c=0 

m—1 


= 0{t{n)) + ^ a{n + k)h{6{t{n + k)), \{0{t{n + k))), 




m — 1 


+ ^ a(n + k){h{9{t{n + fc)), Wn+k,zl^l^) - h(9{t(n + k)),\{9n+k), 

H” ^n,n+m 5 


where 5n,n+m = C,n+m “ Cn with C„ = YZi=0 > 1 - 


9*^^\t{m + n))=0{t{n)) + 




h{0*^^\t),X{9*^^\t)),n{t))dt 


Jt{n) 

m—1 


= mn)) + J2a{n + k)h{9*^^1 {t{n + k)), {t{n + k))), 




t{n+m) 


t(n) 




Let t{n) <t< t{n + m). Now, if 0 < k < (m—1) and t € {t{n + k), t{n + fc + 1)], 

< ||0(t(n)|| + II f M0‘(")(r), A(0*(")(r)),^(r))dr|| 

J t{n) 

k — 1 

< ll^nll + E / (IIMO, 0, zl^l)\\ + L«(ll A(0)|| + {K + l)||0*(”)(r)||))dT 

i—o *^*(^+0 

+ f (IIMO, 0, zW Jll + L(1)(|| A(0)|| +{K + l)||0*(")(r)||))dr 

J t(n-\-k) 

< Co + (M + L(i)||A(0)||)T + L(i)(iG + l) [ ||0‘(")(T)||dr, 

J t{n) 

where Cq = sup„ ||6>„|| < oo,sup^gg(i) ||/i(0,0, 2 ;)|| = M. By GronwalTs inequality, it follows that 

||6»*(”)(t)|| < (Co + (M + L(i)||A(0)||)T)e^''’(*^+^)'^. 




< 


/ t{n-\-k) 


||M0‘(”Hs),A(0‘(")(s)),zW )||ds 


< (||h(0,0, Z«,)|| + L(i)||A(0)||)(t - t{n + k)) 

+ L(i)(iF + l) [ ||0*(")(s)||ds 

J t{n-\-k) 

< CTa{n + k), 
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where Ct = [M + L(i)||A(0)||) + L^^\K + l)(C'o + (M + L(i)||A(0)||)r)e^''’(^+i)'^. Thus, 

J t{n) 

m-l ^t{n+k+l) 

< E / 

k=0 

"*-l pt{n+k+l) 

<lE / ||0*(”Hi)-0*(’^)(t(n + fc))||di 

fc=0 *("+fc) 

m— 1 

< CtL E 

oo 

< CtL E] 0 as n ^ oo, where L = [K + 1). 

k=0 

Hence 


A(0‘(")W), Z«,) - A(0‘(")([t])), zWjlldt 


m—1 

\\9{t{n + m)) — + m)) < L ^ a(n + A:)||^(i(n + k)) — + A:))|| 

/c^O 

oo 

+ C'TiE«(^ + +sup||(5„,„+fc|| 

fe=0 
m — 1 

+ E + fc)lkn+fc - A(6»„+/c)|| 

m—1 

< L ^ a{n + A:)||^(i(n + A:)) — + A))|| 

/c=0 


oo 

+ +supP„,„+fe|| 

/c=0 

+ L^^^Tsup ||w;„+fe - A(6»„+fc)||, a.s. 
fc >0 


Define 

OO 

= CrL'^ain + kf + sup ||5n.n+/c|| + sup ||u;„+fe - X{9n+k)\\- 

, ^ /c>0 fc>0 

Note that Kt.u 0 a.s. The remainder of the proof follows in the exact same manner as the tracking lemma, see 
Lemma 1, Chapter 2 of In- 

Lemma 3.8. Suppose, —>■ S . Let 6^{-) ,n = 1, 2,..., oo denote solutions to m corresponding to 

the case where p{-) is replaced by p^{-), for n = 1,2,... ,oo. Suppose 0'^(0) ^ 0°°(0). Then 

lim sup lir(i)-6»°°(t)|| ^ 0 
"->°o te[o.T] 


for every T > 0. 

Proof 11. It is shown in Lemma 2.3 of \2f)f that 


f{s,z)p^{s,dz)ds- 


f{s, z)p°°{s, dz)ds —>■ 0 


for any f £ (7(10,T] x S). Using this, one can see that 

II [\h{O^{s),X{0^{s)),^i-is))-hi0°°{s),Xi0^is)),^k°°is)))ds\\^O. 

Jo 

This follows because X is continuous and h is jointly continuous in its arguments. As a function of t, the integral 
on the left is equicontinuous and pointwise bounded. By the Arzela-Ascoli theorem, this convergence must in fact be 
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uniform for t in a compact set. Now for t > 0, 


\\0-{t)-9^{t)\\ 

< 11^(0) - 0-(O)|| + f \\h{9-{s),X{e^{s)),^^-{s)) - A(0-(s)),Ai-(s))||ds 

< ||r(0) - 0“(O)|| + [\\\h{9^{s),Xi9-{s)),^r-{s)) - h{9^ (s), X{9^ (s)), ^,-{s))\\)ds 

Jo 

+ A||^(0“(s),A(0“(s)),/."(s))-M0-(s),A(0-(s)),/i“(s))||)d5. 

Jo 

Now, using the fact that X is Lipschitz with constant K the remaining part of the proof follows in the same manner 
as Lemma 2.3 of JWf . 

Note that Lemma [3.81 shows that every limit point {fl{-),9{-)) of (/r(s + .),0{s + .)) as s —>■ oo is such that 9{-) 
satisfies (HU) with /i(-) = /i(-). 

Lemma 3.9. Almost surely every limit point of (/i(s + .), 9{s + .)) as s —>■ oo is of the form (/i(-), 0(-)), where fl{-) 
satisfies fl{t) G D^^^9(t), X(9(t))). 

Proof 12. Suppose that u(n) f oo, pL{u{n) + .)—>■ /i(-) and 9(u{n) + .)—?> 0(-). Let {fi} be countable dense in the 
unit hall of C{S), hence it is a separating class, i.e., for all i, J fidpL = J fidv implies pi = v. For each i, 

n — 1 

C=J2 aM(/*(^mli) - 


J My)pidy\Z^rn\^m\0m,Wm)), 


is a zero-mean martingale with Fn = a(9m,Wm, Zm\ Am\m < n),n > 1. Moreover, it is a square-integrable 
martingale due to the fact that fi’s are bounded and each is a finite sum. Its quadratic variation process 

n — 1 

Al„ = ^ - 

m—0 


J My)pidy\Z^J,\Ai^\9rr.,w^)f\F^] + E[iQf] 


is almost surely convergent. By the martingale convergence theorem, {C^} converges a.s. Let T(n,t) = min{m > n : 
t{m) > t{n) + t} for t > 0, n > 0. Then as n ^ co, 

r(n,t) 

XI «M(/*(^m+i)- / h{.y)p{dy\Z^\A^^ ,9m,Wm)) ^ a.s., 

m—n 

for t > 0. By our choice of {fi} and the fact that {a(n)} are eventually non-increasing, 

r(n,i) 

X {a{m) - a{m + l))/,(zXi) ^ 0, a.s. 


Thus, 


r{n,t) 


X / My)pidy\Z}{\A^^\9m,Wm)) ^0, U.S. 


which implies 




'tin) 


i / My)pidy\z,a,9{s),w{s)))n{s,dzda))ds ^ 0, a.s. 


Recall that u(n) can be any general sequence other than t(n). Therefore 

Li(n)+i p p 


(n) 


i ifiiz)- / My)pidy\z,a,9{s),w{s)))fi{s,dzda))ds ^ 0, a.s.. 


(this follows from the fact that a{n) —^ 0 and fi’s are bounded) where 9{s) = 9n and w{s) = Wn when s G [t{n), t(n - 
1)))''^ ^ 0- Now, one can claim from the above that 




lin) 


i iMz)- / fi{y)p{dy\z,a,9{s),X{9{s))))y{s,dzda))ds ^ 0, a.s. 
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This is due to the fact that the map x C/(i)xR‘^+'=9(z,a,0 fi{y)p{dy\z,a,9,w) is continuous and hence 

uniformly continuous on the compact set A = 

X X Ml X M2 where Mi is the compact set s.t. G Mi for all n and M2 = 

{w : ||w|| < max(sup llwnll,/f')} where K' is the bound for the compact set \{Mi). Here we also use the fact 
that \\wm — ^(^(s))|| 0 for s € [tmAm+i) as X is Lipschitz and \\wm — A(0m)|| —>■ 0. The above convergence is 

equivalent to 



(Mz) 


fi{y)p{dy\z, a, 0{s + u(n)), X(0(s + u(n)))))y(s + u(n), dzda))ds —>■ 0 a.s. 


Fix a sample point in the probability one set on which the convergence above holds for all i. Then the convergence 
above leads to 


(/ Mz)- / fi{y)p{dy\z,a,0{s),X{9{s))))il{s,dzda)ds = 0\/i. 


( 20 ) 


For showing the above, we use one part of the proof from Lemma 2.3 of JWf that if yA{-) —>■ G U then for any 

t, ^ ^ 

f{s,z,a)fj,^{s,dzda)ds— / / f{s,z,a)y°°{s,dzda)ds^Q 


for all f G (^([Ojt] X x In addition, we make use of the fact that fn{s,z,a) = 

J My)p{dy\z,a,0{s + u{n)),X{0{s + u{n)))) converges uniformly to f{s,z,a) = f ft(y)p(dylz, a, 0(s), X(0(s))). To 
prove this, define g : (^([Ojt]) x [0,t] x x M. by g{0{-), s, z,a) = f fi{y)p{dy\z,a,0{s), X{0{s))). Let 

A' = {0(u('n.) + •)l[u(n),u(n)+t]) ^ ^ l}U6>(-)l[o,t]- Using the same argument as in Lemma l^TB and (A6), i.e., X is Lip¬ 
schitz (the latter helps to claim that if Oni') —>■ uniformly then A(0„(-)) —>■ X{9{-)) uniformly), it can be seen that 

g is continuous. Then A' is compact as it is a union of a sequence of functions and its limit. So, 5|(A'x[o,t]xS<i)x(7(i)) 
is uniformly continuous. Then a similar argument as in Lemma \2.).\ shows equicontinuity 0 /{/„(.,.)} that results in 
uniform convergence and thereby f20f) . An application of Lebesgue’s theorem in conjunction with i20\) shows that 


j(Mz) - J f^{y)p{dy\z, a, 0{f), X{0{t)))jl{t, dzda) = 0 Vi 
for a.e. t. By our choice of{fi}, this leads to 

fl{t,dy X UU'i) = Jp(dy\z,a,9{t), X{9{t)))fL{t,dzda), 

a.e. t. 

Lemma 13^ shows that every limit point {)!{■), 9{-)) of {fi{s + .), 9{s + .)) as s —>■ 00 is such that 0{-) satisfies (fTOl) 
with fi{-) = Hence, 9{-) is absolutely continuous. Moreover, using Lemma 1531 one can see that it satisfies dH]) 
a.e. t, hence is a solution to the differential inclusion (jH]). 


Proof 13 (Proof of Theorem l2.6l and l2.7p . From the previous three lemmas it is easy to see that Aq = nt>o{0(s) : s > t} 
is almost everywhere an internally chain transitive set of m- 

Proof 14 (Proof of Corollary [T]) . Follows directly from Theorem, \2 . 61 and Lemma \2.1[ 

4 Discussion on the assumptions: Relaxation of (A2) 

We discuss relaxation of the uniformity of the Lipschitz constant w.r.t state of the controlled Markov process for the 
vector field. The modified assumption here is 

(A2)’ h : X SU) —>■ R'^ is jointly continuous as well as Lipschitz in its first two arguments with the third argument 

fixed to same value and Lipschitz constant is a function of this value. The latter condition means that 

MzU) e sU\ \\h{0,w, zU^) - h{0',w', < LU){ 2 U)){\\e - 0'\\ + \\w - w'll). 

A similar condition holds for g where the Lipschitz constant is L^U : 

Note that this allows LUi{.) to be an unbounded measurable function making it discontinuous due to (Al). The 
straightforward solution for implementing this is to additionally assume the following: 

(A8) sup„ < 00 a.s. 
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still allowing to be an unbounded function. As all our proofs in Section |3] are shown for every sample point 

of a probability 1 set, our proofs will go through. In the following we give such an example for the case where the 
Markov process is uncontrolled. 

It is enough to consider examples with locally compact (because then we can take the standard one-point 
compactification and define arbitrarily at the extra point). 

Let 5'^*^ = Z and let Zn'^ , n > 0 be the Markov Chain on Z starting at 0 with transition probabilities p{n, n-|-l) = p 
and p{n, n — 1) = 1 — p. We assume 1/2 < p < I. Let L^^\n) = 

(z) (i) (i) 

Note that Z/ ,n > 0 is a transient Markov Chain with Z/ —>■ -l-oo a.s. From this it follows that inf„ Z/ > —oo, 
and thus sup„ < oo almost surely. It follows that is a bounded sequence with probability 

1, but this bound is clearly not deterministic since there is a non-zero probability that the sample path reaches large 
negative values. 

However in the following we discuss on the idea of using moment assumptions to analyze the convergence of single 
timescale controlled Markov noise framework of [20]. We show that the iterates ([13 (with Cn = 0) converge to an 
internally chain transitive set of the o. d.e. m- For this we prove Lemma l3.II under the following assumptions: For 
all T > 2,f = 1,2, 

(51) The controlled Markov process W as described in [50] takes values in a compact metric space. 

(52) For all n > 0, 0 < a{n) < 1, a{n) = oo, a(n)^ < oo and a{n -I-1) < a(n), n > 0. 

(53) Lipschitz in its first argument w.r.t the second. The condition means that 

yzeS,\\h{0,z)-h{0',z)\\<L{z){\\0-0'\\). 


(S4) Let (/(n, T) = max(TO : a{n) + a(n -I- 1) -I- • • • -|- a{n + m) <T) with the bound depending on T. Then 


sup A 

n 


16 


sup 

. 0<m<^(n,T) j 


< OO. 


(S5) 


sup A 


a{n+m)L(Yn+m.) 


< OO. 


Note that (S4) and (S5) are trivially satisfied in the case when L{z) = L for all z S 5 i.e. the case of Section 

H 


Remark 9. As long as one can prove Lemma \3.1\ for all T > 2 it will hold for all T > 0, thus one can combine 
(S4) and (S5) into the following assumption: 


sup A 

n 




< OO. 


As an instance where such an assumption is verified, consider the Markov process of m Egn. (3-4)] defined 
by 

W+I = A(0„)W + B{0n)Wn+l 

where A{0),B{0), 0 G are k x k-matrices and (IF„)„>o o.re independent and identically distributed R'^-valued 
random variables. Assume that the following conditions hold true for all x,y G S: 

(a) L(Yn) is a non-decreasing seguence. 

(b) For r > 0, i? > 0, 

sup e^L(A(6)x+B{e)v) < A 

\\e\\<R 

for some Cr, Mr, Lr > 0 and or < 1. 

Then 


E 

< 




< LRaRX'-^^^^ + MrE 


,CrL(W„) 


=LRaRX 


r rL(x) 


Kr, 
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with Kr = MrE (this follows from the fact that Wn are i.i.d if we assume that E < oo). 

Choosing large values of r, one can show that 


E 


erEY.)\Y^_^ = x,0n-i = d + Kr 


where (3r = Lrur^ < 1. Using the above, for large r 


E 


:,rHY„) 


= E 


E 


which shows that 


Choosing r > 8T, 


CCYn)^Y„.uOn-l]] <(3 rE 
sup E 


^rL(F„_i) 


Kf 


„rL(Yn) 


sup E 


„8TL{Y„) 


< OO. 


< OO. 


Note that this is a much weaker assumption that (A8). 

(S6) The noise sequence Mn,n > 0 (need not be a martingale difference sequence) satisfies the following condition 


sup-E 


' rt>{n,T) 

||fbfn+m+l| 

m=0 


< oo. 


(S7) sup„ ||6»„|| < oo. 

With the above assumptions we prove the following tracking lemma: 

Lemma 4.1. For any T > 0, supjgjg^^+T^] \\0{f) — d®(t)|| —>■ 0, a.s. 

Proof 15. Let t{n) <t< t{n + m). Now, if0<k< (m — 1) and t € (t(n + k),t(n + fc + 1)], 

||0‘(”ni)ll<ll^'Wn)|| + || r Md‘(”)(T),A^(r))dT|| 

Jt{n) 

k-1 „t{n+l+l) 

< ll^«ll +E / (||M0,r„+0ll 

^_0 ^t{n+l) 


’ t{n-\-k) 


(||MO,T„+fc)||+L(r„+fc)||d‘(”Hr)||))dT 


<Co + MT+ f L(y(T))||d‘(”)(T)||dT 

J t{n) 


where Ylr) = Yn if t € [t{n),t{n + 1)). Then it follows from an application of Gronwall inequality that 

||6'*(”)(t)|| < a.e. t 

where C = Cq + MT. Next, 

||0‘(”)(i)-^‘(”)Wn + fc))|| < r ||h(d‘(")(s),r„+fc)||ds 

J t{n-\-k) 


< ||h(o,r„+fc)||(t-t(n + fc)) + L(r„+fc) f \\e*^^\s)\\ds 

J t{n+k) 

pt 

< Main + k) + CL{Yn+k) / ds. 

J t{n-\-k) 
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Then 


where 




f t{Y] 


{h{e^^-\t),p.{i^)-h{e^^-\[t]),m)))dt\\ 


Mn+k+l) 

<E / \\h{ed'^\t),Yr,+u)-h{ed-\[t]),Yr,+u)\\dt 

k=0 dt(n+k) 

m-1 „t(n+k+l) 

< E Myn+k) / - ed^Htin + k))\\dt 

J,_Q Jt{n+k) 

m — 1 

< E 


Ck = L(y„+fc)a(n + kf \m + CL(y„+fc)e^'=« 


||0(i(n + m)) 


m—1 

+ m))|| < ^ L{Yn+k)ciin-\-k)\\9{t{n-\-k)) - -\-k))\\ 

fc=0 


m — 1 

^ ^ C-k H“ ||'^n,n+m||; 
fc=0 


wftere (5„,„+m = ^ a{k)Mk+i. 

Therefore using discrete Gronwall inequality we get 

\\6{t{n + m)) — 9*^'^^t{n + m))\\ < r(m,n)e^'==o 


where r{m,n) = J2T=oi‘^k + a{n + A:)||M„+fc+i||). 
Now, for some A G [0,1], 


\\od-\t)-m\\ 

< (1 - A)||6»‘(”)(i(n + m + 1)) - 0(t(n + m + 1)) + A||6»‘(”)(i(n + m)) — 9{t{n + rn))\\ 

/t(n+m+l) 

||/i(0*^"^(s),/i(s))||(is 


> t{n-\-m) 


< r{m + 1, n)e^'==o -|- a(n + m) M + CL{Yn+m)e^’‘=° a{n+k)L(Yn+k) 


Therefore 


p{n,T) ■= sup \\9d'^\t) — 9{f)\\ <r{(j){n,T+ l),n)e^'^=° °-i'^+k)L{Y„+k) 

t<^[t{n) ,t{n)+T] 


+ a{n) 


M + C sup a{n+k)L{Y„+k) 

0<m<(p{n,T) 


Now to prove the a.s. convergence of the quantity in the left hand side as n ^ oo, we have using Cauchy-Schwartz 
inequality: 


n—1 


n—1 


E E[pin, Tf] <2 Kt E {e [(rWn, T + 1), n))"] ) + 4M^ E + 


n=0 


4C'2 E a{nfE 


n—1 


sup L(Yn+m) e 

. 0<m<(^(n,T) j 


2Et=o''^’a(n+fc)L(r„+,) 


where Kt 


\j SUp„ E[e^ a{n+k)L(Y„+k^ 


which depends only on T due to (S5). 


Now, the third term in the 
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R.H.S is clearly finite from the assumptions (Sf) and (S5). Now we analyze the first term i.e. 


n—1 


r{(l){n,T+ l),nf ^ <2^2^ 


n—1 


E 


1/2 


+ 2^2^ 


Cfc 


^0(n,T) 

E I ^ a{n + k)\\Mn+k+i\ 


V 


1/2 


( 21 ) 


Next we analyze the first term in the R.H.S of i21\) again using Cauchy-Schwartz ineguality: 


( 


E 

n—1 


E 


V 


E 

fc=0 


1/2 




n—1 


sup L{Yn+k) 

\p<k<4>{n,T) J 


1/2 


8C2^<^(n,r)2a(n)4 E 


( sup L(r„+fc)) aiu+^)LiY„+,} 

\0<fc<</(n,T) J 


1/2 


Therefore the the R.H.S will be finite if we can show that X)))^i T)'^a{n)'^ is finite. For common step-size seguence 

a{n) = (j){n,T) = 0(n) thus the above series converges clearly. One can make the series converge for all a{n) = ^ 

with ^ < k < 1 by putting assumptions on higher moments in (Sf) and (S5) . 

In the above we have used the following ineguality repeatedly for non-negative random variables X and Y: 


E 


{X + Yf 


2n-l 

<22 




with n G N. 
Now, 


E 

n=l 


E 


V 


'4>{n,T) Y 

a{n + k)\\Mn+k+i\\\ 

k=o J 


1/2 


/ 


E 


< '^afnf 

n—1 

which is finite under assumption (S5) and the fact that a(n) are non-increasing. 


V 




)■]) 


1/2 


5 Application : Off-policy temporal difference learning with linear 
function approximation 

In this section, we present an application of our results in the setting of off-policy temporal difference learning with 
linear function approximation. In this framework, we need to estimate the value function for a target policy tt given 
the continuing evolution of the underlying MDP (with finite state and action spaces S and A respectively, specified 
by expected reward r{-,-,-) and transition probability kernel p(-|-,-)) ^ behaviour policy tti, with tt 7Tb. The 

authors of munis] have proposed two approaches to solve the problem: 

(i) Sub-sampling: In this approach, the transitions which are relevant to deterministic target policy are kept 
and the rest of the data is discarded from the given “on-policy” trajectory. We use the triplet (S, R, S') to 
represent (current state, reward, next state). Therefore one has “off-policy” data {Xf,Rn,Wn),n > 0 where 
E[Rn\X^ = s,Wn = s'] = r(s,a, s'), P{Wn = s'\X'^ = s) =p(s'|s, a) with 7r(s) = a, tt being the target policy 
and X'„,n > 0 is a random process generated by sampling the “on-policy” trajectory at increasing stopping 
times. 
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(ii) Importance-weighting: In this approach, unlike sub-sampling, all the data from the given “on-policy” trajectory 
is used. One advantage of this method is that we can allow the policy to be randomized in case of both behaviour 
and target policies unlike the sub-sampling scenario where one can use only deterministic policy as a target 
policy. 

Then they introduce gradient temporal difference learning algorithms (GTD) [11] \T2\ [3] for both the approaches. 

Currently, all GTD algorithms make the assumption that data is available in the “off-policy” setting i.e. of the 
form {X!^, Rn,Wn),n > 0 where {X!^} are i.i.d, E[Rn\X!^ = s,Wn = s'] = r{s,a,s') and P(Wn = s'\X'^ = s) = 
p(s'|s,a) with 7r(s) = a, tt being the deterministic target policy. Additionally, the distribution of {X^} is assumed 
to be sampled according to the stationary distribution of the Markov chain corresponding to the behaviour policy. 
However, such data cannot be generated from sub-sampling given only the “on-policy” trajectory. The reason is 
that a Markov chain sampled at increasing stopping times cannot be i.i.d. In the following, we show how gradient 
temporal-difference learning along with importance weighting can be used to solve the off-policy convergence problem 
stated above for TD when only the “on-policy” trajectory is available. 


5.1 Problem Definition 

Suppose we are given an on-policy trajectory (An, A„, i?„, X„+i), n > 0 where {A„} is a time-homogeneous irre¬ 
ducible Markov chain with unique stationary distribution z/ and generated from a behavior policy 'Kb ^ k. Here the 
quadruplet {S, A, R, S') represents (current state, action, reward, next state). Also, assume that 7rb(a|s) > 0 for all 
s G S,a £ A. We need to find the solution 9* for the following: 

0= v{s)K{a\s)p(s'\s,a)d{0; s,a, s')(l){s) 

s,a,s' 

= b-Ae, (22) 


where 

(i) 9 gM.‘^ is the parameter for value function, 

(ii) (/): S' —>■ is a vector of state features, 

(hi) X ^ V, 

(iv) 0 < 7 < 1 is the discount factor, 

(v) E[Rn\Xn = S, A„+i = s'] = Z]aeA^6(®l'S)’'('S> 

(vi) P(A„+1 = s'jA = s) = J2aeA^b{a\s)p{s'\s,a), 

(vii) (5(0; s, a, s') = r{s, a, s') + 70 ^^(s') — 9'^(j){s) is the temporal difference term with expected reward, 

, ...N 7r(A„|X) 

(vm) px.A„ = ^\A:\xy 

(ix) (5x,/j„,x„+i = RnE 70 ^(/)(A„+i) — 9'^(j){X) is the online temporal difference, 

(x) A = E[px,aMX){HX) - 7 <(.(A„+i))^], 

(xi) b = E[px,A„Rn4>{X)]. 

Hence the desired approximate value function under the target policy tt is V* = 9*"^cj). Let Vg = 9"^cj). It is well-known 
([3]) that 9* satisfies the projected fixed point equation namely 

Vg = ng,,T^Vg, 


where 

= argmin(||I/- /III,), 
j^y 

with Q = {Vg\0 G R"^} and the Bellman operator 

T''Vg{p)=Y,Yl K{a\i)p{j\i,a) [yVg{i) +r{i,a,j)\. 
jeSaeA 

Therefore to find 9*, the idea is to minimize the mean square projected Bellman error J(9) = jjVe — Ug^^T'^VgUl 
using stochastic gradient descent. It can be shown that the expression of gradient contains product of multiple 
expectations. Such framework can be modelled by two time-scale stochastic approximation where one iterate stores 
the quasi-stationary estimates of some of the expectations and the other iterate is used for sampling. 
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5.2 The TDC Algorithm with importance-weighting 

We consider the TDC (Temporal Difference with Correction) algorithm with importance-weighting from Sections 4.2 
and 5.2 of [3]. The gradient in this case can be shown to satisfy 

--VJ(0) = E[px,R„5x,R„,x„+iiO)(l){X)] - 'yE[px,R„4>i^n+i)(l)iX)'^]w{9), 
w{9) = E[(j){X)(l){X)'^]~^E[px,R„6x,R^,Xn+ii9)4>iX)]. 

Define (/)„ = 4>{Xn), = 4>{Xn+i), Snid) = (6*) and pn = px„,An- Therefore the associated iterations in 

this algorithm are: 

9n-\-l — [dn(^n)^n ; (23) 

Wn +1 =Wn + b{n) [[pn5n{dn) “ <()^W„)(^„] , 


with {a(n)}, {6(n)} satisfying (A4). 

5.3 Convergence Proof 

Theorem 5.1 (Convergence of TDC with importance-weighting). Consider the iterations \2‘j\) of the TDC. Assume 
the following: 

(i) {a(n)}, {6(n)} satisfy (Af). 

(ii) {(A„, i?„, A„+i), n > 0} is such that {A„} is a time-homogeneous finite state irreducible Markov chain 
generated from the behavior policy TTf, with unique stationary distribution v. E[Rn\Xn = s,X„+i = s'] = 

P(Xn+i = s'\Xn = s) = X^agA where TTb is the behaviour policy, 
TT f TTf,. Also, ElRflXn, Xn+i] < oo for all n almost surely, and 

(Hi) C = E\(p{X)((){X)'^] and A = E[px,R„(f>iX){(l)(X) — "/(({Xn+i))'^] are non-singular where X ^ v. 

(iv) 7rb(a|s) > 0 for all s € S,a € A. 

(v) sup„(||6»„|| -I- ||wn||) < oo w.p. 1. 

Then the parameter vector On converges with probability one as n ^ oo to the TD(0) solution i22l) . 

Proof 16. The iterations lt2A) can be cast into the framework of Section \2.2\ with 

(l) = Xn-l, 

(a) h{e,W,z) = E[{pn{5n{0)(l)n “ ^4>'n(ffiW))\Xn-l = Z,0n = 0,Wn = w], 

(Hi) g{9, W, z) = E[{{pn5n{9) - (j)nW)(pn)\Xn-l = Z,9n= 9, Wn = w], 

(iv) = p„{Sn(9n)(pn “ 'ypnPn'Wn) “ E[pni6n{9n)(l)n “ j4>n(pn'Wn)\Xn-l, 9n, Wn], 

(v) — (Pndn(9n) Pn^n^Pn E^(pnSn(9n] pn Wn]Pn\Xn—l, 9n, Wn], 

(vi) EYi — (rifni, Wm, Rm— 1 , Xyn—l, Ani— 1 , ITl Tl, i — l,2),yr ^ 0. 

Note that in (ii) and (Hi) we can define h and g independent of n due to time-homogeneity of {Xn}. 

Now, we verify the assumptions (A1)-(A7) (mentioned in Sections \2.2\ and \2.A} for our application: 

(i) (Al): Zn\'in,i = 1,2 takes values in compact metric space as {Xn} is a finite state Markov chain. 

(ii) (A5): Continuity of transition kernel follows trivially from the fact that we have a finite state MDP. 

Remark 10. In fact we don’t have to verify this assumption for the special case when the Markov chain is 
uncontrolled and has unique stationary distribution. The reason is that in such case (A5) will be used only in 
the proof of Lem,m,a \2.A However, if the Markov chain has unique stationary distribution Lem,m,a \2.A trivially 
follows. 

(Hi) (A2) 
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(a) 


\\h{e,w,z) - h{e\w',z)\\ 

= \\E[pn{9 - O'Y - 4>{Xn))4>iXn) - J Pn4>{Xn+l)4>{XnY {w - w')\Xn-l = z]\\ 

< L{2\\9 - 9'\\M^ + \\w - w'\\M^), 

where M = max^gs ||'/'(s)|| with S being the state space of the MDP and L = niax(s (j)g( 5 xA) • Hence 

h is Lipschitz continuous in the first two arguments uniformly w.r.t the third. In the last ineguality above, 
we use the Cauchy-Schwarz inequality. 

(b) As with the case of h, g can be shown to be Lipschitz continuous in the first two arguments uniformly w.r.t 
the third. 

(c) Joint continuity of h and g follows from (iii)(a) and (b) respectively as well as the finiteness of S. 

(iv) (A3): Clearly, {AlYhfji = 1,2 are martingale difference sequences w.r.t. increasing a-fields Tn,. Note that 
< K{1 + \\9nY + ll^nll^) o,.s., n>0 since E[R'^\Xn, Xn+i] < oo for all n almost surely and S 

is finite. 

(v) (A4): This follows from the conditions (i) in the statement of Theorem \5.1l 

Now, one can see that the faster o.d.e. becomes 

w{t) = E[px,A„Sx,R„,x„+i{9)4>iX)] - E[(j){X)(l){xY]w{t). 

Clearly, C~^E[px,AYx,R„,Xn+A9)4>{X)] is the globally asymptotically stable equilibrium of the o.d.e. Moreover, 

V'{9,w) = ^\\Cw—E[px,Ar,Sx,Rr,,x,,+i {9)(j){X)]Y is continuously differentiable. Additionally, \{9) =C ^E[px,AnSx,Rn,x„+i (9) 
and it is Lipschitz continuous in 9, verifying (A6)’. For the slower o.d.e., the global attractor is A ^ E[px,AnHn(t>{N)] 
verifying the additional assumption in Corollary The attractor set here is a singleton. Also, (A7) is (v) in the 
statement of Theorem \5.1l Therefore the assumptions (Al) — (A5), (A6'), (A7) are verified. The proof would then 
follow from Corollary]^ 

Remark 11. The reason for using two time-scale framework for the TDC algorithm is to make sure that the o.d.e’s 
have globally asymptotically stable equilibrium. 

Remark 12. Because of the fact that the gradient is a product of two expectations the scheme is a “pseudo”-gradient 
descent which helps to find the global minimum here. 

Remark 13. Here we assume the stability of the iterates li2!i\) . Certain sufficient conditions have been sketched for 
showing stability of single timescale stochastic recursions with controlled Markov noise \21[ p. 75, Theorem 9]. This 
subsequently needs to be extended to the case of two time-scale recursions. 

Another way to ensure boundedness of the iterates is to use a projection operator. However, projection may 
introduce spurious fixed points on the boundary of the projection region and finding globally asymptotically stable 
equilibrium of a projected o.d.e. is hard. Therefore we do not use projection in our algorithm. 

Remark 14. Convergence analysis for TDC with importance weighting along with eligibility traces cf. P- '^4] 
where it is called GTD(X)can be done similarly using our results. The main advantage is that it works for A < -^ 

(X € [0,1] being the eligibility function) whereas the analysis in m is shown only for X very close to 1. 

Remark 15. One can analyze this algorithm when the state space is infinite by imposing assumptions on cf as well 
as the target and behavior policies. 

6 Conclusion 

We presented a general framework for two time-scale stochastic approximation with controlled Markov noise. More¬ 
over, using a special case of our results, i.e., when the random process is a finite state irreducible time-homogeneous 
Markov chain (hence has a unique stationary distribution) and uncontrolled (i.e, does not depend on iterates), we 
provided a rigorous proof of convergence for off-policy temporal difference learning algorithm that is also extendible 
to eligibility traces (for a sufficiently large range of A) with linear function approximation under the assumption 
that the “on-policy” trajectory for a behaviour policy is only available. This has previously not been done to our 
knowledge. 
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